First Example

The first example uses the program at the bottom of page 315 (with the ADDD replaced by MULTD). The program is shown below.


\begin{assembly}
\instr{}{ld}{f2,a}{}
\instr{}{add}{r1,r0,xtop}{}
\instr{loop:}{...
...nch delay slot}
\instr{}{trap}{\char93 0}{; terminate simulation}
\end{assembly}

The simulator is invoked by typing dlxsim at the system prompt.

% dlxsim

First the datafile is loaded, using the load command:

load fdata.s

Next, the program may be loaded. The program above was created with an editor and saved in the file f1.s. It is loaded in the same way as the datafile.

load f1.s

To verify that the program has been loaded, the get command can be used to examine memory. The program is loaded at location 256 by default. The second parameter to get indicates how many words to dump. The i suffix tells get to dump the contents in instruction format (i.e. produce a disassembly).

get 256 9i

start:	ld f2,a(r0)
start+0x4:	addi r1,r0,0xe0
loop:	ld f0,a(r1)
loop+0x4:	multd f4,f0,f2
loop+0x8:	sd a(r1),f4
loop+0xc:	subi r1,r1,0x8
loop+0x10:	bnez r1,loop
loop+0x14:	nop
loop+0x18:	trap 0x0

To make sure that the statistics are all cleared (as they should be when is first invoked), use the stats command with the relevant parameters:

stats stalls branch pending hw

Memory size: 65536 bytes.

Floating Point Hardware Configuration
 1 add/subtract units, latency =  2 cycles
 1 divide units,       latency = 19 cycles
 1 multiply units,     latency =  5 cycles
Load Stalls = 0
Floating Point Stalls = 0

No branch instructions executed.

Pending Floating Point Operations:
none.

The hw specifier causes the memory size and floating point hardware information to be dumped. The stalls specifier causes the total load stalls and floating point stalls to be displayed. The branch specifier causes the branch information (taken vs. not taken) to be displayed; in this case no branches have been executed yet. Finally, the pending specifier causes the pending operations in the floating point units to be displayed (none in this case). Below, the first four instructions are executed using the step command:

step 256

stopped after single step, pc = start+0x4: addi r1,r0,0xe0

step

stopped after single step, pc = loop: ld f0,a(r1)

step

stopped after single step, pc = loop+0x4: multd f4,f0,f2

step

stopped after single step, pc = loop+0x8: sd a(r1),f4

The stats command can produce some more interesting results at this point.

stats stalls pending

Load Stalls = 1
Floating Point Stalls = 0

Pending Floating Point Operations:
multiplier   #1 :  will complete in  4 more cycle(s)  87.964594 ==> F4:F5

A load stall occurred between the third and fourth instructions because of the F0 dependency. The multiply instruction has issued, and is being processed in multiplier unit #1. It will complete and store the double precision value 87.96 into F4 and F5 in four more clock cycles.

The double precision value in F4 can be displayed using the fget command with a d specifier (for double precision).

fget f4 d

f4:	0.000000

As expected, F4 hasn't received its value yet. Executing one more instruction will change the statistics:

step

stopped after single step, pc = loop+0xc: subi r1,r1,0x8

stats stalls pending

Load Stalls = 1
Floating Point Stalls = 4

Pending Floating Point Operations:
none.

Since the SD instruction used the result from the multiply instruction, the multiply was completed before the SD was executed. The four floating point stalls required for the multiply to complete were recorded as well. If F4 is examined now, its value after the writeback is displayed.

fget f4 d

f4:	87.964594

To execute the program to completion, the go command can be used. When the TRAP instruction is detected, the simulation will stop.

go

TRAP #0 received

To view the cumulative stall and branch information, the stats command can be used.

stats stalls branch

Load Stalls = 28
Floating Point Stalls = 112

Branches:  total 28, taken 27 (96.43%), untaken 1 (3.57%)

The loop executed 28 times. There was a single load stall per iteration, for a total of 28 load stalls. There were 4 floating point stalls per iteration, for a total of 112 floating point stalls. Finally, the conditional branch at the bottom of the loop was taken 27 times, and fell through on the final time. All these statistics are reflected above.

To verify the program operated properly, the memory locations containing the original data can be examined with the fget command. The original data was stored in the 28 double words beginning at location 8.

fget 8 28d

x:	3.141593
x+0x8:	6.283185
x+0x10:	9.424778

... etc. ...

x+0xc8:	81.681409
x+0xd0:	84.823002
xtop:	87.964594

As expected, the initial integer values have all been multiplied by π.